What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?
library(tidyverse)
data <- read.csv("world_clean.csv")
crude_birth_rate_perK)We are trying to predict the birth rate in a country using socioeconomic features and controlling by region. Main variables we are using are:
years_in_school)
gdp)
any_method)
mean_age_of_childbearing)
region_l)
score)
We are going to perform two types of analyses:
We can appreciate a statistically significant negative association, which is bolstering our hypothesis. Also, the residual errors do not show a pattern that might alert us of any bias.
# fit the model
lm_birth_rate_predicted_by_years_in_school <-
lm(crude_birth_rate_perK ~ years_in_school , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_years_in_school)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.687403 1.2139104 33.51763 2.522863e-77
## years_in_school -2.231353 0.1273078 -17.52724 4.068120e-40
# make a residual plot
plot(lm_birth_rate_predicted_by_years_in_school$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
We also can appreciate a statistically significant negative association, which is bolstering our hypothesis that Women with a better income have access to better medical attention and family planning.But also, the effect is very small compared with education (-3.e-4 vs -2, v.gr. 10,000 smaller). Additionally, the residual errors do not show a pattern that might alert us of any bias.
# fit the model
lm_birth_rate_predicted_by_gdp <-
lm(crude_birth_rate_perK ~ gdp , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_gdp)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.5893416583 7.440081e-01 33.049830 8.881464e-80
## gdp -0.0003102258 3.329204e-05 -9.318316 3.443326e-17
# make a residual plot
plot(lm_birth_rate_predicted_by_gdp$residuals,
pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")
We also can appreciate a statistically significant negative association and see that the effect is higher than gdp but lower than education and could be related with the fact that the use of a contraceptive method does not guarantee unplanned children if it is not used in a right way but a higher level of education will imply an effective use (let’s see how those 2 variables correlate in the correlation matrix). Also, the residual errors do not show a pattern that might alert us of any bias.
# fit the model
lm_birth_rate_predicted_by_any_method <-
lm(crude_birth_rate_perK ~ any_method , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_any_method)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.6168928 1.35037556 28.59715 3.034824e-66
## any_method -0.3329505 0.02408666 -13.82302 1.836519e-29
# make a residual plot
plot(lm_birth_rate_predicted_by_any_method$residuals,
pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")
We see a negative association, which makes sense as the older a woman start having children, it is less probable that she will have many kids, but the association is not as robust as the previous variables. Also, the residuals seem to be loaded to low values (it is not uniform as it should be). Let’s use this variable with precautions because it can have endogenity.
# fit the model
lm_birth_rate_predicted_by_mean_age_of_childbearing <-
lm(crude_birth_rate_perK ~ mean_age_of_childbearing , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_mean_age_of_childbearing)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.0714324 13.1807533 3.191884 0.001641682
## mean_age_of_childbearing -0.7492439 0.4518191 -1.658283 0.098827863
# make a residual plot
plot(lm_birth_rate_predicted_by_mean_age_of_childbearing$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
The abortion legality score is statistically positive. Notwithstanding, the residuals do not look uniform so we should use this variable with caution.
# fit the model
lm_birth_rate_predicted_by_score <- lm(crude_birth_rate_perK ~ score , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_score)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.876226 0.9608522 16.523068 3.382000e-38
## score 2.152454 0.3140607 6.853622 1.051584e-10
# make a residual plot
plot(lm_birth_rate_predicted_by_score$residuals,
pch = 16, col = "brown", xlab = "Observations", ylab = "Residuals")
Regions have different effects on child birth, but for some cases the effect is not statistically significant.
# fit the model
lm_birth_rate_predicted_by_regionL <- lm(crude_birth_rate_perK ~ region_l , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_regionL)$coeff
## Estimate Std. Error t value
## (Intercept) 12.757000 3.918524 3.2555628
## region_lCentral and Southern Asia 9.242714 4.189078 2.2063839
## region_lEastern and South-eastern Asia 2.850368 4.119603 0.6919037
## region_lEurope and Northern America -2.474833 4.010737 -0.6170520
## region_lLatin America and the Caribbean 3.647316 4.020320 0.9072202
## region_lNorthern Africa and Western Asia 6.876040 4.072249 1.6885116
## region_lOther Oceania 10.693000 4.232489 2.5264092
## region_lSub-saharan Africa 19.876120 3.996126 4.9738475
## Pr(>|t|)
## (Intercept) 1.335460e-03
## region_lCentral and Southern Asia 2.853080e-02
## region_lEastern and South-eastern Asia 4.898250e-01
## region_lEurope and Northern America 5.379239e-01
## region_lLatin America and the Caribbean 3.654159e-01
## region_lNorthern Africa and Western Asia 9.291981e-02
## region_lOther Oceania 1.232246e-02
## region_lSub-saharan Africa 1.441194e-06
# make a residual plot
plot(lm_birth_rate_predicted_by_regionL$residuals,
pch = 16, col = "cyan", xlab = "Observations", ylab = "Residuals")
Because of this, it is sensible to use dummy coded variables for those regions where the effect is robust such as Sub-Saharan Africa, Central and Southern Asia, and Other Oceania.
# fit the model
lm_birth_rate_predicted_by_regionL <-
lm(crude_birth_rate_perK ~ r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia , data = data)
# get the coefficient table
summary(lm_birth_rate_predicted_by_regionL)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.826151 0.5489264 27.009359 2.571208e-68
## r_Subsaharan_Africa 17.806969 1.0298772 17.290380 1.978141e-41
## r_OtherOceania 8.623849 1.8615000 4.632742 6.531027e-06
## r_CS.Asia 7.173563 1.7358577 4.132576 5.292316e-05
# make a residual plot
plot(lm_birth_rate_predicted_by_regionL$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
# select the variable we need
linear_model_variables <- data %>%
select(mean_age_of_childbearing, any_method, gdp, years_in_school, score)
# write a function to deal with NAs
NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
# apply the function to the data frame
replaced_nulls <- replace(linear_model_variables, TRUE,
lapply(linear_model_variables, NA2mean))
# get the correlation matrix
cor(replaced_nulls)
## mean_age_of_childbearing any_method gdp
## mean_age_of_childbearing 1.00000000 -0.1232125 0.3153997
## any_method -0.12321247 1.0000000 0.2518059
## gdp 0.31539967 0.2518059 1.0000000
## years_in_school 0.08497914 0.5971134 0.3338028
## score -0.01238248 -0.3394830 -0.2139738
## years_in_school score
## mean_age_of_childbearing 0.08497914 -0.01238248
## any_method 0.59711345 -0.33948304
## gdp 0.33380275 -0.21397378
## years_in_school 1.00000000 -0.40091731
## score -0.40091731 1.00000000
After estimating the correlation matrix of all numeric variables (we are excluding the regions), the variable any_method is highly correlated with years of education, gdp and score, but let’s see how the features interact in the model.
# fit the model
lm_birth_rate_all_features <-
lm(crude_birth_rate_perK ~ r_Subsaharan_Africa+ r_OtherOceania + r_CS.Asia +
mean_age_of_childbearing + any_method + gdp + years_in_school + score,
data = data)
# get the coefficient table
summary(lm_birth_rate_all_features)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.325162e+01 9.119563e+00 4.7427293 4.967604e-06
## r_Subsaharan_Africa 9.708195e+00 1.111123e+00 8.7372846 5.142354e-15
## r_OtherOceania 7.007703e+00 2.034647e+00 3.4441870 7.478515e-04
## r_CS.Asia 2.325165e+00 1.400901e+00 1.6597640 9.910836e-02
## mean_age_of_childbearing -4.676312e-01 3.013162e-01 -1.5519614 1.228365e-01
## any_method -9.882125e-02 2.443069e-02 -4.0449637 8.446670e-05
## gdp -1.803369e-05 3.321743e-05 -0.5428986 5.880275e-01
## years_in_school -8.644171e-01 1.633836e-01 -5.2907206 4.373005e-07
## score 6.433929e-01 2.090061e-01 3.0783449 2.486984e-03
# make a residual plot
plot(lm_birth_rate_all_features$residuals, pch = 16, col = "pink",
xlab = "Observations", ylab = "Residuals")
gdp and mean_age_of_childbearing are not statistically significant. Let’s remove them.
# fit the model
lm_birth_rate_all_features <-
lm(crude_birth_rate_perK ~ r_Subsaharan_Africa+ r_OtherOceania + r_CS.Asia+ any_method+years_in_school+score, data = data)
# get the coefficient table
summary(lm_birth_rate_all_features)$coeff
# plot the residuals
plot(lm_birth_rate_all_features$residuals, pch = 16, col = "purple",
xlab = "Observations", ylab = "Residuals")
After dropping gdp and mean_age_of_childbearing, we have all features as statically significant.
infant_mortality)We are trying to predict the infant mortality rate in a country using socioeconomic features and controlling by region. Main variables we are using are:
years_in_school)
gdp)
any_method)
mean_age_of_childbearing)
region_l)
score)
We are going to perform two types of analyses:
The results are robust sustaining the hypothesis that more years of school strongly hinder child mortality and we do not see patterns on the residuals.
# fit the model
lm_infant_mortality_predicted_by_years_in_school <-
lm(infant_mortality ~ years_in_school , data = data)
# get the coefficient table
summary(lm_infant_mortality_predicted_by_years_in_school)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.914704 2.4292821 24.66354 3.301987e-58
## years_in_school -4.354876 0.2556391 -17.03525 1.103044e-38
# make a residual plot
plot(lm_infant_mortality_predicted_by_years_in_school$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
The results are also robust sustaining the hypothesis that a better income hinders child mortality but the effect is very low (-4e-4) and we do not see patterns on the residuals.
# fit the model
lm_infant_mortality_predicted_by_gdp <-
lm(infant_mortality ~ gdp , data = data)
# get the coefficient table
summary(lm_infant_mortality_predicted_by_gdp)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.7129193939 1.480034e+00 18.048852 7.994141e-43
## gdp -0.0004178504 5.784906e-05 -7.223114 1.246609e-11
# make a residual plot
plot(lm_infant_mortality_predicted_by_gdp$residuals,
pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")
The results are also robust sustaining the hypothesis that using contraceptive methods hinders child mortality because parents will have planned children and we do not see patterns on the residuals.
# fit the model
lm_infant_mortality_predicted_by_any_method <-
lm(infant_mortality ~ any_method , data = data)
# get the correlation table
summary(lm_infant_mortality_predicted_by_any_method)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.328728 2.79659552 19.42674 1.703507e-44
## any_method -0.620588 0.05083354 -12.20824 7.808829e-25
# make a residual plot
plot(lm_infant_mortality_predicted_by_any_method$residuals,
pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")
We see a negative effect but it is not statistically significant. Also, residuals are negatively loaded (we should not blindly trust on using it).
# fit the model
lm_infant_mortality_predicted_by_mean_age_of_childbearing <-
lm(infant_mortality ~ mean_age_of_childbearing , data = data)
# get the correlation table
summary(lm_infant_mortality_predicted_by_mean_age_of_childbearing)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.696984 27.2926651 2.553689 0.01147853
## mean_age_of_childbearing -1.665865 0.9368675 -1.778122 0.07705340
# make a residual plot
plot(lm_infant_mortality_predicted_by_mean_age_of_childbearing$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
The abortion legality score is statistically positive.
# fit the model
lm_infant_mortality_predicted_by_score <-
lm(infant_mortality ~ score , data = data)
# get the correlation table
summary(lm_infant_mortality_predicted_by_score)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.478451 1.9625691 6.867759 8.921706e-11
## score 3.181517 0.6254993 5.086365 8.683561e-07
# make a residual plot
plot(lm_infant_mortality_predicted_by_score$residuals,
pch = 16, col = "cyan", xlab = "Observations", ylab = "Residuals")
Regions have different effects on child mortality, but for some cases the effect is not statistically significant.
# fit the model
lm_infant_mortality_predicted_by_regionL <-
lm(infant_mortality ~ region_l , data = data)
# get the correlation table
summary(lm_infant_mortality_predicted_by_regionL)$coeff
## Estimate Std. Error t value
## (Intercept) 3.4950000 8.385827 0.41677464
## region_lCentral and Southern Asia 20.5928571 8.964826 2.29707268
## region_lEastern and South-eastern Asia 12.4575000 8.894513 1.40058264
## region_lEurope and Northern America 0.3640244 8.587924 0.04238794
## region_lLatin America and the Caribbean 11.0162903 8.652110 1.27324903
## region_lNorthern Africa and Western Asia 10.1897826 8.742829 1.16550177
## region_lOther Oceania 18.8361111 9.270881 2.03174996
## region_lSub-saharan Africa 41.9991667 8.558749 4.90716198
## Pr(>|t|)
## (Intercept) 6.773511e-01
## region_lCentral and Southern Asia 2.279211e-02
## region_lEastern and South-eastern Asia 1.630995e-01
## region_lEurope and Northern America 9.662375e-01
## region_lLatin America and the Caribbean 2.046088e-01
## region_lNorthern Africa and Western Asia 2.453931e-01
## region_lOther Oceania 4.368259e-02
## region_lSub-saharan Africa 2.094960e-06
# make a residual plot
plot(lm_infant_mortality_predicted_by_regionL$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
Because of this, it is sensible to use a dummy for those regions where the effect is robust such as: Sub-Saharan Africa, Central and Southern Asia, and Other Oceania (the same ones we chose for child birth rate but only Sub-Saharan Africa is strongly robust).
# fit the model
lm_infant_mortality_predicted_by_regionL <-
lm(infant_mortality ~ r_Subsaharan_Africa+ r_OtherOceania + r_CS.Asia ,
data = data)
# get the correlation table
summary(lm_infant_mortality_predicted_by_regionL)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.48717 1.170505 8.959524 4.072181e-16
## r_Subsaharan_Africa 35.00700 2.143708 16.330116 2.288702e-37
## r_OtherOceania 11.84394 4.309550 2.748301 6.600702e-03
## r_CS.Asia 13.60069 3.525422 3.857889 1.591671e-04
# make a residual plot
plot(lm_infant_mortality_predicted_by_regionL$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
# fit the model
lm_infant_mortality_all_features <-
lm(infant_mortality ~ r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia +
mean_age_of_childbearing + any_method + gdp + years_in_school + score,
data = data)
# get the correlation table
summary(lm_infant_mortality_all_features)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.721596e+01 2.028072e+01 4.3004377 3.114477e-05
## r_Subsaharan_Africa 1.752760e+01 2.469567e+00 7.0974399 5.225230e-11
## r_OtherOceania 3.392868e+00 4.535726e+00 0.7480320 4.556524e-01
## r_CS.Asia 2.839746e+00 3.113812e+00 0.9119837 3.632910e-01
## mean_age_of_childbearing -1.447077e+00 6.697831e-01 -2.1605164 3.237556e-02
## any_method -2.545828e-01 5.429228e-02 -4.6891172 6.274800e-06
## gdp 3.769402e-05 7.397471e-05 0.5095528 6.111394e-01
## years_in_school -1.918339e+00 3.652613e-01 -5.2519619 5.261817e-07
## score 5.857769e-01 4.690655e-01 1.2488167 2.137446e-01
# make a residual plot
plot(lm_infant_mortality_all_features$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
gdp is not statistically significant. Let’s remove it first before removing the other ones that also are not significant.
# fit the model
lm_infant_mortality_selected_features_1 <-
lm(infant_mortality ~ r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia +
mean_age_of_childbearing + any_method + years_in_school + score,
data = data)
# get the correlation table
summary(lm_infant_mortality_selected_features_1)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.2512359 16.72203773 4.9187328 2.277793e-06
## r_Subsaharan_Africa 17.2922968 2.47633464 6.9830210 8.910427e-11
## r_OtherOceania 3.3460145 4.50658733 0.7424719 4.589704e-01
## r_CS.Asia 2.9124843 3.13504016 0.9290102 3.543864e-01
## mean_age_of_childbearing -1.3188009 0.54553363 -2.4174512 1.683752e-02
## any_method -0.2324605 0.05319905 -4.3696365 2.320696e-05
## years_in_school -1.8681163 0.35528540 -5.2580723 4.964988e-07
## score 0.6034146 0.45632607 1.3223322 1.880838e-01
# make a residual plot
plot(lm_infant_mortality_selected_features_1$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
After dropping gdp, dummy regions of Oceania and Central & South Asia are not statistically significant. Let’s drop them.
# fit the model
lm_infant_mortality_selected_features_2 <-
lm(infant_mortality ~ r_Subsaharan_Africa + mean_age_of_childbearing +
any_method + years_in_school + score, data = data)
# get the correlation table
summary(lm_infant_mortality_selected_features_2)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.6859785 15.63956153 5.542737 1.294874e-07
## r_Subsaharan_Africa 16.2808837 2.30026215 7.077838 5.142701e-11
## mean_age_of_childbearing -1.4176880 0.52308599 -2.710239 7.501791e-03
## any_method -0.2514928 0.05009273 -5.020545 1.434175e-06
## years_in_school -1.8663711 0.34485164 -5.412099 2.396220e-07
## score 0.6286242 0.43781274 1.435829 1.531194e-01
# make a residual plot
plot(lm_infant_mortality_selected_features_2$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
After dropping Oceania and Central & South Asia, abortion legality score remains as non-relevant. Let’s remove it from the last model.
# fit the model
lm_infant_mortality_selected_features_3 <-
lm(infant_mortality ~ r_Subsaharan_Africa + mean_age_of_childbearing +
any_method + years_in_school, data = data)
# get the correlation table
summary(lm_infant_mortality_selected_features_3)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.0329231 15.51878221 5.801546 3.690442e-08
## r_Subsaharan_Africa 15.6359604 2.26384699 6.906810 1.268724e-10
## mean_age_of_childbearing -1.4087263 0.52487213 -2.683942 8.082807e-03
## any_method -0.2603678 0.04988321 -5.219547 5.797571e-07
## years_in_school -2.0469299 0.32222530 -6.352480 2.332289e-09
# make a residual plot
plot(lm_infant_mortality_selected_features_3$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
After dropping abortion law score, years in school, mean age of childbearing and use of contraceptive methods reduce infant mortality and being in a country from Sub-Saharan Africa increases it drastically.
We are trying to predict the HIV prevalence in a country using socioeconomic features and controlling by region. Main variables we are using are:
years_in_school)
gdp)
any_method)
barrier)
region_l)
score)
We are going to perform two types of analyses:
Note: As HIV prevalence is not reported by all countries, let’s see how many countries have a null value
sum(is.na(data$HIV_prev))
## [1] 77
77 Countries did not report HIV prevalence. Given this, let’s use only countries reporting it and be aware that reporting might play as a bias.
data_2 <- data %>% filter(HIV_prev > -1)
Years in School does not seem to impact HIV prevalence by itself. Also, residuals are highly loaded to a negative value (let’s be aware of using it to predict HIV prevalence).
# fit the model
lm_HIV_prev_predicted_by_years_in_school <-
lm(HIV_prev ~ years_in_school, data = data)
# get the correlation table
summary(lm_HIV_prev_predicted_by_years_in_school)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.163907 0.88478812 3.575892 0.0004728503
## years_in_school -0.150590 0.09449427 -1.593642 0.1131642202
# make a residual plot
plot(lm_HIV_prev_predicted_by_years_in_school$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
GDP shows a negative impact but the impact is not robust. Also, we see that residuals are far from being uniform.
# fit the model
lm_HIV_prev_predicted_by_gdp <- lm(HIV_prev ~ gdp , data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_gdp)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.435442e+00 4.094515e-01 5.948060 1.835189e-08
## gdp -4.441732e-05 1.779406e-05 -2.496188 1.363523e-02
# make a residual plot
plot(lm_HIV_prev_predicted_by_gdp$residuals,
pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")
Any method not seem to impact HIV prevalence by itself. Also, residuals are highly loaded to a negative value. It seems that the HIV prevalence might have an error.
# fit the model
lm_HIV_prev_predicted_by_any_method <- lm(HIV_prev ~ any_method , data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_any_method)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51657153 0.93688942 2.6860924 0.008103997
## any_method -0.01066428 0.01643212 -0.6489899 0.517408125
# make a residual plot
plot(lm_HIV_prev_predicted_by_any_method$residuals,
pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")
Residuals remain as not uniform and the effect is not robust.
# fit the model
lm_HIV_prev_predicted_by_barrier <- lm(HIV_prev ~ barrier , data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_barrier)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5648087 0.45700411 3.4240582 0.0007957091
## barrier 0.0326779 0.03384156 0.9656143 0.3357914411
# make a residual plot
plot(lm_HIV_prev_predicted_by_barrier$residuals,
pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")
Residuals remain as not uniform and the effect is not robust.
# fit the model
lm_HIV_prev_predicted_by_score <- lm(HIV_prev ~ score, data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_score)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70694645 0.4866471 3.5075655 0.000594908
## score 0.06230606 0.1674310 0.3721298 0.710314483
# make a residual plot
plot(lm_HIV_prev_predicted_by_score$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
Not a single region explains HIV prevalence.
# fit the model
lm_HIV_prev_predicted_by_regionL <- lm(HIV_prev ~ region_l , data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_regionL)$coeff
## Estimate Std. Error t value
## (Intercept) 0.150000000 2.582546 0.058082218
## region_lCentral and Southern Asia 0.038461538 2.774099 0.013864513
## region_lEastern and South-eastern Asia 0.180769231 2.774099 0.065163213
## region_lEurope and Northern America 0.171081081 2.651426 0.064524177
## region_lLatin America and the Caribbean 0.664814815 2.676487 0.248390799
## region_lNorthern Africa and Western Asia -0.004285714 2.760858 -0.001552313
## region_lOther Oceania 0.250000000 3.652272 0.068450550
## region_lSub-saharan Africa 5.047826087 2.638091 1.913438963
## Pr(>|t|)
## (Intercept) 0.95376256
## region_lCentral and Southern Asia 0.98895700
## region_lEastern and South-eastern Asia 0.94813319
## region_lEurope and Northern America 0.94864112
## region_lLatin America and the Caribbean 0.80418109
## region_lNorthern Africa and Western Asia 0.99876355
## region_lOther Oceania 0.94552064
## region_lSub-saharan Africa 0.05764911
# make a residual plot
plot(lm_HIV_prev_predicted_by_regionL$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
We startred from the full model:
# fit the model
lm_HIV_prev_all_features <- lm(HIV_prev ~ barrier + any_method + gdp +
years_in_school + score, data = data)
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.539335e+00 1.403389e+00 2.5219905 0.01286716
## barrier 1.049583e-01 5.016233e-02 2.0923738 0.03833489
## any_method 5.929340e-03 2.469437e-02 0.2401090 0.81062106
## gdp -5.695014e-05 2.721358e-05 -2.0927106 0.03830426
## years_in_school -2.370405e-01 1.753000e-01 -1.3521987 0.17864153
## score -1.184903e-01 2.203071e-01 -0.5378413 0.59159920
# make a residual plot
plot(lm_HIV_prev_all_features$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
any_method and score are not significant in the full model so we removed them.
# fit the model
lm_HIV_prev_all_features <- lm(HIV_prev ~ barrier + gdp + years_in_school, data = data)
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.114311e+00 9.400400e-01 3.312956 0.001171199
## barrier 9.685712e-02 4.151207e-02 2.333228 0.021040042
## gdp -4.703231e-05 2.426076e-05 -1.938617 0.054531568
## years_in_school -1.770056e-01 1.321507e-01 -1.339422 0.182573208
# make a residual plot
plot(lm_HIV_prev_all_features$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
yeaers_in_school is not significant so we removed it.
# fit the model
lm_HIV_prev_all_features <- lm(HIV_prev ~ barrier + gdp, data = data)
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.026605e+00 0.475240271 4.264379 3.569819e-05
## barrier 6.827073e-02 0.035445320 1.926086 5.602366e-02
## gdp -6.293099e-05 0.000020753 -3.032381 2.868845e-03
# make a residual plot
plot(lm_HIV_prev_all_features$residuals,
pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")
Not a single variable explains HIV prevalence which is counter-intuitive and is a strong signal that the data might be wrong.